Saturating Auto-Encoders 



Rostislav Goroshin * 

Courant Institute of Mathematical Science 
New York University 

goroshinScs . nyu . edu 



Yann LeCun 

Courant Institute of Mathematical Science 
New York University 
yanngcs . nyu . edu 



Abstract 



We introduce a simple new regularizer for auto-encoders whose hidden-unit ac- 
tivation functions contain at least one zero-gradient (saturated) region. This reg- 
ularizer explicitly encourages activations in the saturated region(s) of the corre- 
sponding activation function. We call these Saturating Auto-Encoders (SATAE). 
We show that the saturation regularizer explicitly limits the SATAE's ability to 
reconstruct inputs which are not near the data manifold. Furthermore, we show 
that a wide variety of features can be learned when different activation functions 
are used. Finally, connections are established with the Contractive and Sparse 
Auto-Encoders. 



1 Introduction 

An auto-encoder is a conceptually simple neural network used for obtaining useful data rep- 
resentations through unsupervised training. It is composed of an encoder which outputs a 
hidden (or latent) representation and a decoder which attempts to reconstruct the input using 
the hidden representation as its input. Training consists of minimizing a reconstruction cost 
such as L2 error However this cost is merely a proxy for the true objective: to obtain a useful 
latent representation. Auto-encoders can implement many dimensionality reduction techniques 
such as PC A and Sparse Coding (SC) Q ||6| |[7j. This makes the study of auto-encoders very 
appealing from a theoretical standpoint. In recent years, renewed interest in auto-encoders net- 
works has mainly been due to their empirical success in unsupervised feature learning ||T] Q |j3) Q. 

When minimizing only reconstruction cost, the standard auto-encoder does not typically learn any 
meaningful hidden representation of the data. Well known theoretical and experimental results 
show that a linear auto-encoder with trainable encoding and decoding matrices, and re- 
spectively, learns the identity function if and are full rank or over-complete. The linear 
auto-encoder learns the principle variance directions (PC A) if and are rank deficient |^. 
It has been observed that other representations can be obtained by regularizing the latent repre- 
sentation. This approach is exemplified by the Contractive and Sparse Auto-Encoders |3| |1| |2|. 
Intuitively, an auto-encoder with limited capacity will focus its resources on reconstructing portions 
of the input space in which data samples occur most frequently. From an energy based perspective, 
auto-encoders achieve low reconstruction cost in portions of the input space with high data density 
(recently, |8| has examined this perspective in depth). If the data occupies some low dimensional 
manifold in the higher dimensional input space then minimizing reconstruction error achieves low 
energy on this manifold. Useful latent state regularizers raise the energy of points that do not lie 
on the manifold, thus playing an analogous role to minimizing the partition function in maximum 
likelihood models. In this work we introduce a new type of regularizer that does this explicitly for 
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auto-encoders with a non-linearity that contains at least one flat (zero gradient) region. We show ex- 
amples where this regularizer and the choice of nonlinearity determine the feature set that is learned 
by the auto-encoder. 

2 Latent State Regularization 

Several auto-encoder variants which regularize their latent states have been proposed, they include 
the sparse auto-encoder and the contractive auto-encoder |1| |2| |3|. The sparse auto-encoder in- 
cludes an over-complete basis in the encoder and imposes a sparsity inducing (usually Li) penalty 
on the hidden activations. This penalty prevents the auto-encoder from learning to reconstruct all 
possible points in the input space and focuses the expressive power of the auto-encoder on repre- 
senting the data-manifold. Similarly, the contractive auto-encoder avoids trivial solutions by intro- 
ducing an auxiliary penalty which measures the square Frobenius norm of the Jacobian of the latent 
representation with respect to the inputs. This encourages a constant latent representation except 
around training samples where it is counteracted by the reconstruction term. It has been noted in |3| 
that these two approaches are strongly related. The contractive auto-encoder explicitly encourages 
small entries in the Jacobian, whereas the sparse auto-encoder is encouraged to produce mostly zero 
(sparse) activations which can be designed to correspond to mostly flat regions of the nonlinearity, 
thus also yielding smaU entries in the Jacobian. 

2.1 Saturating Auto-Encoder through Complementary Nonlinearities 

Our goal is to introduce a simple new regularizer which explicitly raises reconstruction error for 
inputs not near the data manifold. Consider activation functions with at least one flat region; these 
include shrink, rectified linear, and saturated linear (Figure[T]l. Auto-encoders with such nonlineari- 
ties lose their ability to accurately reconstruct inputs which produce activations in the zero-gradient 
regions of their activation functions. Let us denote the auto-encoding function x,. = G(x, W), x 
being the input, W the trainable parameters in the auto-encoder, and x,- the reconstruction. One can 
define an energy surface through the reconstruction error: 

Ew{x) = \\x - G{x,W)\\^ 

Let's imagine that G has been trained to produce a low reconstruction error at a particular data 
point X* . If G is constant when x varies along a particular direction v, then the energy will grow 
quadratically along that particular direction as x moves away from x* . If G is trained to produce 
low reconstruction errors on a set of samples while being subject to a regularizer that tries to make 
it constant in as many directions as possible, then the reconstruction energy will act as a contrast 
function that will take low values around areas of high data density and larger values everywhere 
else (similarly to a negative log likelihood function for a density estimator). 

The proposed auto-encoder is a simple implementation of this idea. Using the notation W = 
{W^, B'^ , W^, B'^}, the auto-encoder function is defined as 

G(x, W) = W^FiWx + B") + B"^ 

where W, B'^, W^, and B"^ are the encoding matrix, encoding bias, decoding matrix, and decoding 
bias, respectively, and F is the vector function that applies the scalar function / to each of its 
components. / will be designed to have "flat spots", i.e. regions where the derivative is zero (also 
referred to as the saturation region). 

The loss function minimized by training is the sum of the reconstruction energy E\y{x) = \\x — 
G(x, W) IP and a term that pushes the components of W^x + B^ towards the flat spots of /. This is 
performed through the use of a complementary function fc, associated with the non-linearity f{z). 
The basic idea is to design fc{z) so that its value corresponds to the distance of z to one of the 
flat spots of f{z). Minimizing fc{z) will push z towards the flat spots of f{z). With this in mind, 
we introduce a penalty of the form fc{J2j=i ^ij^j + which encourages the argument to be 
in the saturation regime of the activation function (/). We refer to auto-encoders which include 
this regularizer as Saturating Auto-Encoders (SATAEs). For activation functions with zero-gradient 
regime(s) the complementary nonlinearity (/c) can be defined as the distance to the nearest saturation 
region. Specifically, let S = {z \ f'{z) = 0} then we define fc{z) as: 
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Figure 1: Three nonlinearities (top) with their associated complementary regularization func- 
tions(bottom). 



= inf |z-z'|. (1) 

z'es 

Figure 1 shows three activation functions and their associated complementary nonhnearities. The 
complete loss to be minimized by a SATAE with nonlinearity / is: 



^ = E ^11-^ - (W-'FiW^x + B^) + B'')f+aY, fdW^x + b^), (2) 

xeD i=l 

where dh denotes the number of hidden units. The hyper-parameter a regulates the trade-off between 
reconstruction and saturation. 



3 Effect of the Saturation Regularizer 

We will examine the effect of the saturation regularizer on auto-encoders with a variety of activation 
functions. It will be shown that the choice of activation function is a significant factor in determining 
the type of basis the SATAE learns. First, we will present results on toy data in two dimensions 
followed by results on higher dimensional image data. 

3.1 Visualizing the Energy Landscape 

Given a trained auto-encoder the reconstruction error can be evaluated for a given input x. For 
low-dimensional spaces (M", where n < 3) we can evaluate the reconstruction error on a regular 
grid in order to visualize the portions of the space which are well represented by the auto-encoder 
More specifically we can compute E{x) — -^Wx — XriP for all x within some bounded region of 
the input space. Ideally, the reconstruction energy will be low for all x which are in the training set 
and high elsewhere. Figures |2] and [s] depict the resulting reconstruction energy for inputs x G M^, 
and —l<Xi<l. Black corresponds to low reconstruction energy. The training data consists of 
a one dimensional manifold shown overlain in yellow. Figure |2] shows a toy example for a SATAE 
which uses ten basis vectors and a shrink activation function. Note that adding the saturation regu- 
larizer decreases the volume of the space which is well reconstructed, however good reconstruction 
is maintained on or near the training data manifold. The auto-encoder in Figure |3] contains two 
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Figure 2: Energy surfaces for unregularized (left), and regularized (right) solutions obtained on 
SATAE-shrink and 10 basis vectors. Black corresponds to low reconstruction energy. Training 
points lie on a one-dimensional manifold shown in yellow. 



Figure 3: SATAE-SL toy example with two basis elements. Top Row; three randomly initialized so- 
lutions obtained with no regularization. Bottom Row: three randomly iiutialized solutions obtained 
with regularization. 

encoding basis vectors (red), two decoding basis vectors (green), and uses a saturated-linear activa- 
tion function. The encoding and decoding bases are unconstrained. The unregularized auto-encoder 
learns an orthogonal basis with a random orientation. The region of the space which is well recon- 
structed corresponds to the outer product of the linear regions of two activation functions; beyond 
that the error increases quadratically with the distance. Including the saturation regularizer induces 
the auto-encoder basis to align with the data and to operate in the saturation regime at the extreme 
points of the training data, which limits the space which is well reconstructed. Note that because the 
encoding and decoding weights are separate and unrestricted, the encoding weights were scaled up 
to effectively reduce the width of the linear regime of the nonlinearity. 

3.2 SATAE-shrink 

Consider a SATAE with a shrink activation function and shrink parameter A. The corresponding 
complementary nonlinearity, derived using Equation 1 is given by: 



Note that shrinkciW^x -\- b'^) = abs{shrink{W^x + b'^)), which corresponds to an Li penalty 
on the activations. Thus this SATAE is equivalent to a sparse auto-encoder with a shrink activation 
function. Given the equivalence to the sparse auto-encoder we anticipate the same scale ambiguity 
which occurs with Li regularization. This ambiguity can be avoided by normalizing the decoder 
weights to unit norm. It is expected that the SATAE-shrink will learn similar features to those 
obtained with a sparse auto-encoder, and indeed this is what we observe. Figure [4|c) shows the 
decoder filters learned by an auto-encoder with shrink nonlinearity trained on gray-scale natural 
image patches. One can recognize the expected Gabor-like features when the saturation penalty is 
activated. When trained on the binary MNIST dataset the learned basis is comprised of portions of 
digits and strokes. Nearly identical results are obtained with a SATAE which uses a rectified-linear 
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activation function. This is because a rectified-linear function with an encoding bias behaves as a 
positive only shrink function, similarly the complementary function is equivalent to a positive only 
Li penalty on the activations. 

3.3 SATAE-saturated-linear 

Unlike the SATAE-shrink, which tries to compress the data by minimizing the number of active 
elements; the SATAE saturated-linear (SATAE-SL) tries to compress the data by encouraging the 
latent code to be as close to binary as possible. Without a saturation penalty this auto-encoder learns 
to encode small groups of neighboring pixels. More precisely, the auto-encoder learns the identity 
function on all datasets. An example of such a basis is shown in Figure [4|b). With this basis the 
auto-encoder can perfectly reconstruct any input by producing small activations which stay within 
the linear region of the nonlinearity. Introducing the saturation penalty does not have any effect 
when training on binary MNIST. This is because the scaled identity basis is a global minimizer of 
Equation 2 for the SATAE-SL on any binary dataset. Such a basis can perfectly reconstruct any 
binary input while operating exclusively in the saturated regions of the activation function, thus 
incurring no saturation penalty. On the other hand, introducing the saturation penalty when training 
on natural image patches induces the SATAE-SL to learn a more varied basis (Figure |4jd)). 

3.4 Experiments on CIFAR-10 

SATAE auto-encoders with 100 and 300 basis elements were trained on the CIFAR-10 dataset, 
which contains small color images of objects from ten categories. In all of our experiments the auto- 
encoders were trained by progressively increasing the saturation penalty (details are provided in the 
next section). This allowed us to visually track the effect of the saturation penalty on individual 
basis elements. Figure |4|e)-(f) shows the basis learned by SATAE-shrink with small and large 
saturation penalty, respectively. Increasing the saturation penalty has the expected effect of reducing 
the number of nonzero activations. As the saturation penalty increases, active basis elements become 
responsible for reconstructing a larger portion of the input. This induces the basis elements to 
become less spatially localized. This effect can be seen by comparing corresponding filters in Figure 
Qe) and (f). Figures [4|g)-(h) show the basis elements learned by SATAE-SL with small and large 
saturation penalty, respectively. The basis learned by SATAE-SL with a small saturation penalty 
resembles the identity basis, as expected (see previous subsection). Once the saturation penalty 
is increased small activations become more heavily penalized. To increase their activations the 
encoding basis elements may increase in magnitude or align themselves with the input. However, if 
the encoding and decoding weights are tied (or fixed in magnitude) then reconstruction error would 
increase if the weights were merely scaled up. Thus the basis elements are forced to align with the 
data in a way that also facilitates reconstruction. This effect is illustrated in Figure |5] where filters 
corresponding to progressively larger values of the regularization parameter are shown. The top 
half of the figure shows how an element from the identity basis (a — 0.1) transforms to a localized 
edge (a = 0.5). The bottom half of the figure shows how a localized edge (a = 0.5) progressively 
transforms to a template of a horse (a = 1). 

4 Experimental Details 

Because the regularizer explicitly encourages activations in the zero gradient regime of the nonlin- 
earity, many encoder basis elements would not be updated via back-propagation through the non- 
linearity if the saturation penalty were large. In order to allow the basis elements to deviate from 
their initial random states we found it necessary to progressively increase the saturation penalty. In 
our experiments the weights obtained at a minimum of Equation 2 for a smaller value of a were 
used to initialize the optimization for a larger value of a. Typically, the optimization began with 
a = and was progressively increased to a = 1 in steps of 0.1. The auto-encoder was trained for 
30 epochs at each value of a. This approach also allowed us to track the evolution of basis elements 
as a function of a (Figure [5]l. In all experiments data samples were normalized by subtracting the 
mean and dividing by the standard deviation of the dataset. The auto-encoders used to obtain the 
results shown in Figure |4] (a), (c)-(f) used 100 basis elements, others used 300 basis elements. In- 
creasing the number of elements in the basis did not have a strong qualitative effect except to make 
the features represented by the basis more localized. The decoder basis elements of the SATAEs with 
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Figure 4: Basis elements learned by the SATAE using different nonlinearities on: 28x28 binary 
MNIST digits, 12x12 gray scale natural image patches, and CIFAR-10. (a) SATAE-shrink trained on 
MNIST, (b) SATAE-saturated-lineai- trained on MNIST, (c) SATAE-shi-ink trained on natural image 
patches, (d) SATAE-saturated-linear trained on natural image patches, (e)-(f) SATAE-shrink trained 
on CIFAR-10 with a = 0.1 and a = 0.5, respectively, (g)-(h) SATAE-SL trained on CIFAR-10 with 
a = 0.1 and a = 0.6, respectively. , 
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Figure 5: Evolution of two filters with increasing saturation regularization for a SATAE-SL trained 
on CIFAR-10. Filters corresponding to larger values of a were initialized using the filter corre- 
sponding to the previous a. The regularization parameter was varied from 0. 1 to 0.5 (left to right) 
in the top five images and 0.5 to 1 in the bottom five 



shrink and rectified-linear nonlinearities were reprojected to the unit sphere after every 10 stochastic 
gradient updates. The SATAEs which used saturated-linear activation function were trained with 
tied weights. All results presented were obtained using stochastic gradient descent with a constant 
learning rate of 0.05. 



5 Discussion 

In this work we have introduced a general and conceptually simple latent state regularizes It was 
demonstrated that a variety of feature sets can be obtained using a single framework. The utility of 
these features depend on the application. In this section we extend the definition of the saturation 
regularizer to include functions without a zero-gradient region. The relationship of SATAEs with 
other regularized auto-encoders will be discussed. We conclude with a discussion on future work. 

5.1 Extension to Differentiable Functions 

We would like to extend the saturation penalty definition (Equation 1) to differentiable functions 
without a zero-gradient region. An appealing first guess for the complimentary function is some 
positive function of the first derivative, fc{x) — for instance. This may be an appropriate 

choice for monotonic activation functions which have their lowest gradient regions at the extrema 
(e.g. sigmoids). However some activation functions may contain regions of small or zero gradient 
which have negligible extent, at the extrema for instance. We would like our definition of the com- 
plimentary function to not only measure the local gradient in some region, but to also measure it's 
extent. For this purpose we employ the concept of average variation over a finite interval. We define 
the average variation of / at a; in the positive and negative directions at scale I, respectively as: 



Where * denotes the continuous convolution operator. n^(a;) and (x) are uniform averaging 
kernels in the positive and negative directions, respectively. Next, define a directional measure of 
variation of / by integrating the average variation at all scales. 
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Figure 6: Illustration of the complimentary function (/c) as defined by Equation 3 for a non- 
monotonic activation function (/). The absolute derivative of / is shown for comparison. 
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Where 'w{l) is chosen to be a sufficiently fast decreasing function of I to insure convergence of 
the integral. The integral with which is convolved in the above equation evaluates to some 

decreasing function of x for 11+ with support a; > 0. Similarly, the integral involving 11^ evaluates 
to some increasing function of x with support a; < 0. This function will depend on ■w{l). The 
functions AI^f{x) and M~ f{x) measure the average variation of f{x) at all scales I in the positive 
and negative direction, respectively. We define the complimentary function fc{x) as: 



Ux) = min{M+f{x), f{x)). (3) 

An example of a complimentary function defined using the above formulation is shown in Figure [6] 
Whereas \ f'{x) \ is minimized at the extrema of /, the complimentary function only plateaus at these 
locations. 



5.2 Relationship witli tlie Contractive Auto-Encoder 

Let hi be the output of the i*'' hidden unit of a single-layer auto-encoder with point-wise nonlinearity 
/(•). The regularizer imposed by the contractive auto-encoder (CAE) can be expressed as foUows: 

ij ^ '■> i \ j=i J 

where x is a d-dimensional data vector, /'(•) is the derivative of /(•), hi is the bias of the i*-^ 
encoding unit, and Wf denotes the z*'' row of the encoding weight matrix. The first term in the above 
equation tries to adjust the weights so as to push the activations into the low gradient (saturation) 
regime of the nonlinearity, but is only defined for differentiable activation functions. Therefore the 
CAE indirectly encourages operation in the saturation regime. Computing the Jacobian, however, 
can be cumbersome for deep networks. Furthermore, the complexity of computing the Jacobian is 
0{d X dh), although a more efficient implementation is possible [|3J, compared to the 0{dh) for the 
saturation penalty. 
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5.3 Relationship witli tlie Sparse Auto-Encoder 

In Section 3.2 it was shown that SATAEs with shrink or rectified-hnear activation functions are 
equivalent to a sparse auto-encoder. Interestingly, the fact that the saturation penalty happens to 
correspond to Li regularization in the case of SATAE-shrink agrees with the findings in |7|. In their 
efforts to find an architecture to approximate inference in sparse coding, Gregor et al. found that 
the shrink function is particularly compatible with Li minimization. Equivalence to sparsity only 
for some activation functions suggests that SATAEs are a generalization of sparse auto-encoders. 
Like the sparsity penalty, the saturation penalty can be applied at any point in a deep network for the 
same computational cost. However, unlike the sparsity penalty the saturation penalty is adapted to 
the nonlinearity of the particular layer to which it is applied. 

5.4 Future Work 

We intend to experimentally demonstrate that the representations learned by SATAEs are useful as 
features for learning common tasks such as classification and denoising. We will also address several 
open questions, namely: (i) how to select (or learn) the width parameter (A) of the nonlinearity, and 
(ii) how to methodically constrain the weights. We will also explore SATAEs that use a wider class 
of non-linearities and architectures. 
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